A Polynomial Time Matching Algorithm of Structured Ordered Tree Patterns for Data Mining from Semistructured Data
نویسندگان
چکیده
Tree structured data such as HTML/XML files are represented by rooted trees with ordered children and edge labels. Knowledge representations for tree structured data are quite important to discover interesting features which such tree structured data have. In this paper, as a representation of structural features we propose a structured ordered tree pattern, called a term tree, which is a rooted tree pattern consisting of ordered children and structured variables. A variable in a term tree can be substituted by an arbitrary tree. Deciding whether or not each given tree structured data has structural features is a core problem for data mining of large tree structured data. We consider a problem of deciding whether or not a term tree t matches a tree T , that is, T is obtained from t by substituting some trees for variables in t. Such a problem is called a membership problem for t and T . Given a term tree t and a tree T , we present an O(nN) time algorithm of solving the membership problem for t and T , where n and N are the numbers of vertices in t and T , respectively. We also report some experiments on applying our matching algorithm to a collection of real Web documents.
منابع مشابه
Online Algorithms for Mining Semi-structured Data Stream
In this paper, we study an online data mining problem from streams of semi-structured data such as XML data. Modeling semi-structured data and patterns as labeled ordered trees, we present an online algorithm StreamT that receives fragments of an unseen possibly infinite semistructured data in the document order through a data stream, and can return the current set of frequent patterns immediat...
متن کاملAn Effective Grammar-Based Compression Algorithm for Tree Structured Data
Many semistructured data such as HTML/XML files are represented by rooted trees t such that all children of each internal vertex of t are ordered and all edges of t have labels. Such data is called tree structured data. Analyzing large tree structured data is a time-consuming process in data mining. If we can reduce the size of input data without loss of information, we can speed up such a heav...
متن کاملEfficient Learning of Semi-structured Data from Queries
This paper studies the learning complexity of classes of structured patterns for HTML/ XML-trees in the query learning framework of Angluin. We present polynomial time learning algorithms for ordered gapped tree patterns, OGT, and ordered gapped forests, OGF, under the into-matching semantics using equivalence queries and subset queries. As a corollary, the learnability with equivalence and mem...
متن کاملExtraction of Tag Tree Patterns with Contractible Variables from Irregular Semistructured Data
Information Extraction from semistructured data becomes more and more important. In order to extract meaningful or interesting contents from semistructured data, we need to extract common structured patterns from semistructured data. Many semistructured data have irregularities such as missing or erroneous data. A tag tree pattern is an edge labeled tree with ordered children which has tree str...
متن کاملDiscovery of Frequent Tag Tree Patterns in Semistructured Web Documents
Many Web documents such as HTML files and XML files have no rigid structure and are called semistructured data. In general, such semistructured Web documents are represented by rooted trees with ordered children. We propose a new method for discovering frequent tree structured patterns in semistructured Web documents by using a tag tree pattern as a hypothesis. A tag tree pattern is an edge lab...
متن کامل